install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
install.packages("stringr")
install.packages("magrittr")
install.packages("gapminder")
install.packages("nycflights13")MET581 Lecture 04
Wrangling Data 2: Tidying, Strings and Joins
1 Setup
Install all the packages below if you don’t have them already
Then make sure everything is loaded. The datasets we’ll use today are loaded with the packages, apart from the starwars dataset which will need loading as data('starwars').
1.1 Aim of the lecture
Today we’re going to cover some more ground on using the tidyverse to get data into the required format. The aims are to work through:
- Reshaping data with the tidyr package
- Data wrangling using the dplyr package
- String manipulation using the stringr package
1.2 Learning Objectives
- introduce wide and long formats
- begin converting between wide and long data
- join data frames together with dplyr
- introduce manipulating data with strings in stringr
2 A few notes on the homework
2.1 Basic organisation
- By now there should be some structure to your work
- This might be R scripts or literate programming (Quarto files)
- It’s worth creating a new project for each class, and a new .qmd file for each homework
- Test in the console to build up parts of a query and transfer to the .qmd file
2.2 Scoped verbs vs. pick and across
- We mainly talked about applying functions to individual columns
- But we often want to apply functions to multiple columns
- We can use
pick()andacross()to do this - The scoped variants of dplyr verbs like
select_if()have been superseded (but you will see them everywhere!)
2.3 pick() in masked environments
- in functions like
mutate(),summarise(), andgroup_by()we can refer to columns directly by their names instead of needing quotation marks - why is this? - they “mask” the overall data frame, meaning they provide direct access to column names without needing to explicitly reference the full data frame.
- you will see this referred to as a “data masking environment” in the docs
pick()is likeselect(), but can refer to columns directly in a masked environment, e.g.
gapminder |>
mutate(rank = dense_rank(pick(lifeExp, gdpPercap)))2.4 across for applying functions
- You will cover functional programming in detail with
purrrin later lectures - Functional programming refers to functions which can take other functions as arguments for iteration
acrosscan be used in this way to apply functions to multiple columns- see the R4DS book, ch. 26 for more
2.5 across for applying functions
- for example, we can apply a function to every column that’s an integer
- with the more recent
acrosssyntax, this looks like:
gapminder |>
mutate(across(where(is.numeric), round))- this is equivalent to the now superseded “scoped variant” syntax:
gapminder |>
mutate_if(is.numeric, round)2.6 across for applying functions
- one of the more frustrating things is manually writing out calls to summarise
- with
across, the code fo doing something manual, like:
gapminder |>
summarise(
mean_lifeExp = mean(lifeExp),
mean_pop = mean(pop),
mean_gdpPercap = mean(gdpPercap)
)- can now be simplified to:
gapminder |>
summarise(across(c(lifeExp, pop, gdpPercap), mean))2.7 How to plan a query
For smaller queries:
- sketch out the the “bones” of the query with dplyr verbs
- start with from the end and work backwards
For bigger queries:
- manually or electronically sketch out the data flow
- think about the order of operations (ETL)
- break the query up into manageable chunks if it’s long
- test queries on a small batch of data e.g. read n_max=10 rows for testing
3 Wide and long data
One way of thinking about the conversion between tidy and untidy data is the move from wide-to-long format, and vice versa. What we mean by this is that when we have all of our variables in columns, our data typically appears “longer” or “thinner” and stretches downwards along the page. By contrast, if we spread a variable across several columns, the data appears shorter and wider. A common example fo this might be years: storing a single column titled “year” would make the data appear “long”, while moving each year to have its own column would naturally stretch the data out to look “wide”.
The image below shows us an example of this:
Generally, when reshaping data, we usually have one of two problems:
- a variable is spread over several columns
- an observation is spread over several rows
To solve these, we’ll use the pivot_longer() and pivot_wider() functions from the tidyr package.
Before we get started on converting between these formats, however, we should note two things:
- the terms ‘wide’ and ‘long’ are commonplace but sometimes debated, as a dataset may be small enough that it’s in the ‘wide’ style but doesn’t look wide visually, which can be misleading. What’s ‘wide’ and ‘long’ will depend on the dataset. People in the R stats community sometimes prefer to simple saying a dataset has been converted to be ‘wider’ to ‘longer’ as this emphasises that these are relative terms
- the functions we’re about to use replace the older functions
spread()andgather()from the tidyr package. If you’ve never seen these before don’t worry - this is just to make you aware that you will still see these used in the wild. When you do, you can translatespread()aspivot_wider()andgather()aspivot_longer(). If you already usespread()andgather()you can continue to do so, but note they’re no longer under active development.
3.1 wide to long
To solve the ‘variables as columns’ problem, we use tidyr::pivot_longer()
table4a |>
pivot_longer(c(`1999`, `2000`), names_to = "year", values_to = "cases")This is particularly useful as we can also make use of dplyr::select() style notation:
table4a |>
pivot_longer(-country, names_to = "year", values_to = "cases")3.2 long to wide
By contrast, tidyr::pivot_wider() solves the ‘observations over rows’ problem:
table2 |>
pivot_wider(names_from = type, values_from = count)3.3 tidyr::practice()
Using fish_encounters
- change the variable ‘station’ to be column names, and populate with values from ‘seen’
fish_encounters |>
tidyr::pivot_wider(names_from = station, values_from = seen)Using starwars - load with data("starwars")
- pivot all columns ending with ‘color’ to long format as new columns ‘attribute’ and ‘colour’
starwars |>
tidyr::pivot_longer(
dplyr::ends_with('color'),
names_to = 'attribute',
values_to = 'colour'
)Using flights - load with library(nycflights13)
- pivot all columns ending with ‘time’ to long format as new columns ‘time_type’ and ‘time_value’ then select only columns beginning with ‘time’
flights |>
tidyr::pivot_longer(
dplyr::ends_with('time'),
names_to = 'time_type',
values_to = 'time_value'
) |>
dplyr::select(dplyr::starts_with('time'))3.4 tidyr::practice_more()
Using starwars
- replace
tidyr::function()with the correct call to convert the height and mass columns into ‘characteristic’ and ‘measurement’ columns for plotting below
starwars |>
tidyr::pivot_longer(
c(height, mass),
names_to = 'characteristic',
values_to = 'measurement'
) |>
ggplot(aes(characteristic, measurement)) +
geom_jitter()Using weather
- replace
tidyr::function()with the correct call to convert the temp, dewp and humid columns into ‘condition’ and ‘measurement’ columns for plotting below
weather |>
head(1000) |>
tidyr::pivot_longer(
c(temp, dewp, humid),
names_to = 'condition',
values_to = 'measurement'
) |>
ggplot(aes(measurement, condition)) +
geom_jitter(alpha = 0.6)4 Joins
4.1 Relational data
We’ve been introduced to tidy data already, but what of relational data? When our data is spread over several tables, we have relational data. Often this happens is we’re following tidy principles, i.e. we stored different types of observational units in separate tables. Doing this allows for easy storage and data entry, but at analysis time, we often need all the data together. A good example of this is in the nycflights13 package. This contains many tables which are connected. For example, take the tables below:
flightsplanesWe’ve been good data scientists and stored the information separately, but now we want to do some analysis which involves combining information about planes and flights together.
When we’ve stored data like this, we need to know the relations between pairs of tables. We always think of these relationships as being pairs. Even when there are 3 or more tables which are interconnected, they are defined through their pairwise relationships. In the example above, both are related through the ‘tailnum’ column. In fact, the nycflights13 package has a few tables with differing shared columns:
We don’t need to remember any of these, but just be aware that such situations do indeed occur in everyday datasets. It may be that the relations between your tables are one-to-one, one-to-many, many-to-one, or many-to-many, depending on whether one or more observations in one table match up with one or more in another. This will become more clear in practice.
To work with these relations between tables, we need to think about keys.
4.2 Keys
A key is a variable (or set of variables) that uniquely identifies an observation. It is the backbone of each dataset or set of datasets.
You generally have two types of key:
- Primary key: identifiers observations in its own data frame (eg: planes$tailnum)
- Foreign key: identifies observations in another data frame (eg: flights$tailnum)
It is generally good idea to test whether or not you do have a unique primary key for the data frames you are working with, and may help you eliminate duplications in your data:
planes |>
count(tailnum) |>
filter(n > 1)A final point to note on keys is that a table may not have a unique key! If this happens to you, it’s a good idea to make a surrogate key yourself using the row numbers.
4.3 Joining data
Data sets must share at least one column with the same/similar information that you want to join them together on, called a key.
This can be one or multiple columns, but most important is the the key or keys uniquely identify each row.
If not, and there are multiple matches, all combination of the matches will occur.
There are a variety of different joins.
4.4 Inner join
Syntax:
inner_join(dataset1, dataset2, by = "key")
Example:
inner_join(airlines, flights, by = "carrier")Returns:
- Only rows that exist in dataset1 and dataset2
4.5 Left join
Syntax:
left_join(dataset1, dataset2, by = "key")
left_join(dataset1, dataset2, by = c("key1", "key2"))
Example:
left_join(flights, airlines, by = "carrier")Returns:
- All rows from dataset1
- All columns from dataset1 and dataset2
- If a row exists only in dataset1, data is retained.
- If a row exists only in dataset2, data is not retained.
4.6 Joins - Practice
- Use a left_join to join the flights and planes datasets together
flights |>
dplyr::left_join(planes, by = "tailnum")- Combine the airline information from the “airlines” dataset with the flights dataset using left_join, filter for flights from 2013 and select only the columns to do with arrivals
flights |>
dplyr::left_join(airlines, by = "carrier") |>
filter(year == 2013L) |>
select(dplyr::contains('arr_'))- Join the band_members and band_instruments datasets keep all columns and only those people in both
band_members |>
dplyr::inner_join(band_instruments, by = 'name')4.7 Full join
Syntax:
full_join(dataset1, dataset2, by = "key")
Example:
dplyr::full_join(flights, airlines, by = "carrier")Returns:
- All rows in dataset1 and dataset2
- Joining with missing or duplicate keys
- If data is missing from one dataset but is in the other, depending on the type of join you are performing, this can result in missing data (NA values).
4.8 Right join
This is the same as a left join but with the order of datasets switched. In practice, this is very rarely used - you will almost almost use a left, inner or full join.
Syntax:
right_join(dataset1, dataset2, by = "key")
Example:
dplyr::right_join(airlines, flights, by = "carrier")This is equivalent to (but with the columns in a different order):
dplyr::left_join(flights, airlines, by = "carrier")Returns: - All rows from dataset2 - All columns from dataset1 and dataset2
4.9 Joins - More Practice
- Add data from the weather dataset to flights with a left_join
flights |>
dplyr::left_join(weather)- Join the band_members and band_instruments datasets; keep only the people in the right-hand dataset (band_instruments)
band_members |>
dplyr::right_join(band_instruments, by = 'name')- Join the band_members and band_instruments datasets; keep everything
band_members |>
dplyr::full_join(band_instruments, by = 'name')4.10 Final Tips on Joins
- take the time to learn the different types of joins and choose carefully each time
- anti-joins can be useful to find rows that don’t match between datasets, e.g. to exclude rows with IDs that aren’t in an expected ID list
- there is now a relationships argument in
dplyr::join()- use this to give the expected relationships between data frames (it will stop you making many mistakes)
5 Strings
5.1 But what is a string?
- Strings are a sequence of characters which has to be represented in memory in binary
- This was first widely done using ASCII (American Standard Code for Information Interchange)
- But, ASCII only allows for 128 characters, so not nearly enough for all languages
- Now we represent strings using Unicode (which didn’t appear in it’s current form till the 90s!), which allows for a lot (>1m) characters
- UTF-8 is the most common encoding for Unicode - uses a variable number of bytes (8-bit) units to represent characters
- R uses UTF-8 encoding by default (though will use the label “unknown” unless non-ASCII characters appear), and stringr fully supports Unicode
5.2 Strings and regex
- String manipulation is one of the biggest time-savers in data-analysis. It can also be also one of the more difficult aspects you will need to learn. This is due to the use of regular expressions (regex or regexp for short) which is a programming language in itself.
As quoted by R for Data Science:
“When you first look at a regexp, you’ll think a cat walked across your keyboard, but as your understanding improves they will soon start to make sense.”
If you’ve read about regex before, you will also have come across the quotation:
“Some people, when confronted with a problem, think”I know, I’ll use regular expressions.” Now they have two problems.”
As with all sweeping statements, these should be taken with a pinch of salt. However, they do touch on the depth of the topic, and how the flexibility of regex can easily become something you reach for when a simpler, more readable function is available. You may also find yourself frequently ending up in the situation below.
5.3 An infallible procedure for producing a successful regular expression
Partly for these reasons, we will only briefly look at regex in the actual lecture, with some more detailed notes given below. Proficiency with regular expressions is gained the same way as it is when learning any language though: practice. Instead of repeating many exercises here, we will cover some common use-cases. The get more comfortable using them, you should practice regex in one of the many available online tools, such as Regex1, Regex101 and regexone.
5.4 Stringr
Stringr is a set of functions that remove the inconsistencies that are seen with the base R string manipulation functions. They are however, still limited by your ability to use regular expressions.
5.4.1 Regular Expressions
If you’re comfortable with regular expressions, or you’ve already gone through some of the recommended online material for practising them, you can skip ahead to the more applied section on using stringr.
5.4.1.1 The York Notes
- Allow you to match patterns in strings
- Most basic matches an actual chunk of text, e.g. ‘hag’ in ‘hagrid’
- We can match multiple types of characters with
\s,\d,\w,[abc]and[^abc] - Because we’re in R, we have to use
\\instead of\, e.g.\\wor\\d - We can match any character with
. - We can expand these with
?,+,*or{n,m} - We can anchor them to the start
^or the end$ - e.g.
^\\w+_\\d{4}$would match “hagrid_2020”, but not “hagrid_120”
5.4.1.2 A little more detail
This more thorough introduction will use some of the stringr functions that you’ll be introduced to later, namely str_view(). The reason for doing this is that it’s a really useful function that highlights the matched characters. It’s therefore very helpful for teaching, even if it’s a little out of order. All we’ll do with it is str_view(v, pattern), where v is a vector of strings. As with most things, it’s easier if you look at an example, so just jump straight in below.
We’ll start by getting a few vectors of strings to hand for matching:
starwars_films <- starwars |>
filter(name == "Luke Skywalker") |>
select(films) |>
pull() |>
unlist()
starwars_unique_species <- starwars[["species"]] |>
unique()And then we can try matching plain text:
str_view(starwars_films, "of")Moving on from the basics of matching plain text, the next step up in complexity is ., which matches any character (except a newline):
starwars_unique_species <- str_replace_na(starwars_unique_species)
str_view(starwars_unique_species, ".a.")- We can also include anchors
^to match the start of a string$to match the end of a string
# start anchor
str_view(starwars_unique_species, "^H")
# end anchor
str_view(starwars_unique_species, ".i.$")
# specific anchor
abomination_of_nature <- c(
starwars_unique_species,
str_c(starwars_unique_species[1], starwars_unique_species[5])
)
str_view(abomination_of_nature, "^Human$")But now we have a problem:
- How do we match a literal period aka
.?
The following is almost entirely independent to R, but be wary of using backslashes when pattern matching in general.
See the following, from the R for Data Science textbook:
But if “.” matches any character, how do you match the character “.”? You need to use an “escape” to tell the regular expression you want to match it exactly, not use its special behaviour. Like strings, regexps use the backslash, \, to escape special behaviour. So to match an '.', you need the regexp '\.'. Unfortunately this creates a problem. We use strings to represent regular expressions, and \ is also used as an escape symbol in strings. So to create the regular expression \. we need the string "\\."
The "\\" is used in the regular expressions in R. The "\" is an escape character saying "within this string, the following character should be taken as is." Thus, if you want to actually look for "\" you need to escape it. However, when doing regular expressions in R, you have to "double escape" which is why you see "\\". In R regular expressions like these, if you wanted to actually include "\" you would have to use "\\\\".
If we deliberately introduce a ‘.’ into our vector, we can check the match and see that it now does in fact match correctly!
abomination_of_nature[length(abomination_of_nature)] <- str_c(
".",
starwars_unique_species[1],
starwars_unique_species[5]
)
str_view(abomination_of_nature, "\\.HumanHutt$")You can double-check whether you’re following the sequence of escapes by running the examples below and seeing the output of the matches is what you might expect.
x <- c("\\a", "\\.aas", ".a", "a$")
x2 <- "\u00b5"
x3 <- "\""
str_view(x, "\\\\")
str_view(x2, "\\\\")
str_view(x3, "\"")If the output above was confusing for you, then remember that printing a string doesn’t show it’s actual representation in R. To do this, use writeLines(). Running the lines below might help explain the previous matches by showing the character vectors as R sees them.
writeLines(x)
writeLines(x2)
writeLines(x3)However, this also highlights a general issue with regex: even experienced users can be incorrect about their idea of how a complex regular expression is working. For daily practical use of regular expressions therefore, it’s best to stick with two key principles:
- A good rule of thumb is to always start with “\” in your regex and work through the errors.
- Remember that in programming interpretation is just as important as functionality, because you want other people to understand what you have done. Keep it as simple as possible!
The second point also means that if you can write a more readable bit of code without using regex, it’s probably a good idea.
5.4.1.2.1 Special characters
There are a number of special patterns that match more than one character. You’ve already seen ‘.’, which matches any character apart from a newline. There are a few other useful tools:
\w: matches alphanumerics.\d: matches any digit.\s: matches any whitespace (e.g. space, tab, newline).[abc]: matches a, b, or c.[^abc]: matches anything except a, b, or c.\bmatches a word boundary.
Remember, to create a regular expression containing \d or \s, you’ll need to escape the \ for the string, so you’ll type "\\d" or "\\s"
The [] can be useful in creating easier to read code:
str_view(c("abc", "a.c", "a*c", "a c"), "a[.]c")This works for most (but not all) regex metacharacters: $ . | ? * + ( ) [ {. Unfortunately, a few characters have special meaning even inside a character class and must be handled with backslash escapes: ] \ ^ and -
5.4.1.2.2 Repetition
The next step up in power involves controlling how many times a pattern matches:
?: 0 or 1+: 1 or more*: 0 or more
The most useful but potentially dangerous is .*. This allows you to ignore matching a bunch of characters that would take a while to figure out, allowing you to specify a pattern specific to the problem at hand.
For example:
str_view_all(starwars_unique_species, pattern = ".*\\'.*")Be wary that it is a shortcut and therefore might result in unwanted matches. Generally, the more specific you can make your regex, the less likely you’ll accidentally match something you’re not expecting to, so trying to avoid .* is a good idea.
You can also specify the number of matches precisely:
- {n}: exactly n
- {n,}: n or more
- {,m}: at most m
- {n,m}: between n and m
x <- "1888 is the longest year in Roman numerals: MDCCCLXXXVIII"
str_view(x, "C{2,3}")By default these matches are “greedy”: they will match the longest string possible. You can make them “lazy”, matching the shortest string possible by putting a ? after them. This is an advanced feature of regular expressions, but it’s useful to know that it exists:
str_view(x, 'C{2,3}?')str_view(x, 'C[LX]+?')5.4.1.2.3 Grouping and backreferences?
But what if you wanted to extract a part of the string and store/use it somewhere else in the R expression?
This is where you use backreferences like \\1 and \\2 (group number 1 and 2)
str_view(starwars_unique_species, "(.)\\1", match = TRUE)This is generally useful when you have to make complex but repeatable regular expressions:
sentences |>
str_replace("([^ ]+) ([^ ]+) ([^ ]+)", "\\1 \\3 \\2") |>
head(5)5.4.1.3 A little more practice
Yes, it’s still the same link: regexone. If you’ve gone through it - fantastic! You’ll be able to use the stringr package much more effectively. If you haven’t - invest the time now so you don’t have to learn them on-the-fly when you’re midway through an analysis. It’s worth the effort.
5.4.2 Returning to stringr
Now we can get familiar with the more practical applications of stringr for basic string operations, including cleaning and manipulating them.
5.4.2.1 Basic stringr operations
In this final stretch, we’ll go through practical examples of the most-used functions from the stringr package. These basic operations are fairly fundamental things you’ll need to do quite a lot.
check string lengths and counts
my_string <- c(
'Hagrid',
'Hermione',
'Harry.Potter',
'Ronald_Weasley',
'24xHouse Elves'
)
# counts the number of characters in each string
stringr::str_length(my_string)[1] 6 8 12 14 14
# counts the number of matches in a string
stringr::str_count(my_string, 'Ha')[1] 1 0 1 0 0
concatenate (combine) strings
# explicitly naming each column
stringr::str_c('this', 'that', sep = ', ')[1] "this, that"
5.4.2.2 Manipulations with stringr
We can also do some slightly more complicated operations by extracting or replacing aspects of a string.
extract or replace strings
# extract substrings
stringr::str_sub(my_string, 1, 3)[1] "Hag" "Her" "Har" "Ron" "24x"
# replace matches
stringr::str_replace(my_string, '[\\._x]', ' ')[1] "Hagrid" "Hermione" "Harry Potter" "Ronald Weasley"
[5] "24 House Elves"
5.4.2.3 stringr::practice()
- concatenate the strings “day”, “to” and “day”, separated by a hyphen
stringr::str_c("day", "to", "day", sep = '-')Using starwars
- select the hair color column and replace ‘,’ with a ‘/’ (hint: you can pipe a column into
pull()to convert it to a vector for stringr to handle)
starwars |>
pull(hair_color) |>
stringr::str_replace(', ', '/')Using flights
- select columns ending with ‘delay’ and remove the underscore from all column names
flights |>
select(ends_with('delay')) |>
rename_all(stringr::str_replace, '_', ' ')5.4.2.4 More manipulations with stringr
Next we have some more common operations for finding, viewing, sorting and splitting strings.
find or view strings
# return a boolean for matches (alternative to grepl)
stringr::str_detect(my_string, 'Hagrid')# highlight matches
stringr::str_view(my_string, '^\\w')sort and separate strings
# sorting strings
sentences |>
head(1) |>
stringr::str_split(" ")[[1]]
[1] "The" "birch" "canoe" "slid" "on" "the" "smooth"
[8] "planks."
# sorting strings
stringr::str_sort(words[1:10], locale = 'en') [1] "a" "able" "about" "absolute" "accept" "account"
[7] "achieve" "across" "act" "active"
5.4.2.5 stringr::practice()
- split the string “Harry, did you put your name in the Goblet of Fire?” into its components
stringr::str_split("Harry, did you put your name in the Goblet of Fire?", ' ')- use the
boundary("word")function instead of ” ” and compare results
stringr::str_split(
"Harry, did you put your name in the Goblet of Fire?",
boundary("word")
)Using the first line of the sentences dataset
- split by word boundary, convert all to lowercase and sort
sentences |>
head(1) |>
stringr::str_split(boundary("word")) |>
unlist() |>
stringr::str_to_lower() |>
stringr::str_sort()5.4.2.6 Cleaning up with stringr
Lastly, the simple but ever-useful functions for cleaning up our dataframes
stringr::str_to_upper(words[1:10]) [1] "A" "ABLE" "ABOUT" "ABSOLUTE" "ACCEPT" "ACCOUNT"
[7] "ACHIEVE" "ACROSS" "ACT" "ACTIVE"
stringr::str_to_lower(words[1:10]) [1] "a" "able" "about" "absolute" "accept" "account"
[7] "achieve" "across" "act" "active"
stringr::str_to_sentence(words[1:10]) [1] "A" "Able" "About" "Absolute" "Accept" "Account"
[7] "Achieve" "Across" "Act" "Active"
stringr::str_to_title(words[1:10]) [1] "A" "Able" "About" "Absolute" "Accept" "Account"
[7] "Achieve" "Across" "Act" "Active"
sentences |>
head(1) |>
stringr::str_to_sentence()[1] "The birch canoe slid on the smooth planks."
sentences |>
head(1) |>
stringr::str_to_title()[1] "The Birch Canoe Slid On The Smooth Planks."
5.4.2.7 stringr::practice()
Using starwars
- convert the hair_color column values to be Sentence Case
starwars |>
pull(hair_color) |>
stringr::str_to_sentence()- change all columns names to be title case
starwars |>
rename_all(stringr::str_to_title)Using gapminder
- remove any Camel Case from column names (i.e. all to lower case)
gapminder |>
rename_all(stringr::str_to_lower)- convert the continent column values to be all upper case
gapminder |>
pull(continent) |>
stringr::str_to_upper()6 Homework
- Quarto file
Suggested Reading
- R for Data Science, chapters 6, 15, 16, 20
- The docs for dplyr and vignette at
browseVignettes(package = "dplyr") - Also check out the docs for tidyr and stringr
- This blog post by Joel Spolsky on Unicode and character sets